AMBER: Automatic Supervision for Multi-Attribute Extraction
نویسندگان
چکیده
The extraction of multi-attribute objects from the deep web is the bridge between the unstructured web and structured data. Existing approaches either induce wrappers from a set of human-annotated pages or leverage repeated structures on the page without supervision. What the former lack in automation, the latter lack in accuracy. Thus accurate, automatic multi-attribute object extraction has remained an open challenge. AMBER overcomes both limitations through mutual supervision between the repeated structure and automatically produced annotations. Previous approaches based on automatic annotations have suffered from low quality due to the inherent noise in the annotations and have attempted to compensate by exploring multiple candidate wrappers. In contrast, AMBER compensates for this noise by integrating repeated structure analysis with annotation-based induction: The repeated structure limits the search space for wrapper induction, and conversely, annotations allow the repeated structure analysis to distinguish noise from relevant data. Both, low recall and low precision in the annotations are mitigated to achieve almost human quality (> 98%) multiattribute object extraction. To achieve this accuracy, AMBER needs to be trained once for an entire domain. AMBER bootstraps its training from a small, possibly noisy set of attribute instances and a few unannotated sites of the domain.
منابع مشابه
Object-Oriented Method for Automatic Extraction of Road from High Resolution Satellite Images
As the information carried in a high spatial resolution image is not represented by single pixels but by meaningful image objects, which include the association of multiple pixels and their mutual relations, the object based method has become one of the most commonly used strategies for the processing of high resolution imagery. This processing comprises two fundamental and critical steps towar...
متن کاملSelf-Crowdsourcing Training for Relation Extraction
One expensive step when defining crowdsourcing tasks is to define the examples and control questions for instructing the crowd workers. In this paper, we introduce a self-training strategy for crowdsourcing. The main idea is to use an automatic classifier, trained on weakly supervised data, to select examples associated with high confidence. These are used by our automatic agent to explain the ...
متن کاملAutomatic Emphatic Information Extraction from Aligned Acoustic Data and Its Application on Sentence Compression
We introduce a novel method to extract and utilize the semantic information from acoustic data. By automatic Speech-ToText alignment techniques, we are able to detect word-based acoustic durations that can prosodically emphasize specific words in an utterance. We model and analyze the sentencebased emphatic patterns by predicting the emphatic levels using only the lexical features, and demonstr...
متن کاملEntity Analysis with Weak Supervision: Typing, Linking, and Attribute Extraction
Entity Analysis with Weak Supervision: Typing, Linking, and Attribute Extraction Xiao Ling Chair of the Supervisory Committee: Professor Daniel S. Weld Computer Science and Engineering With the advent of the Web, textual information has grown at an explosive rate. To digest this enormous amount of data, an automatic solution, Information Extraction (IE), has become necessary. Information extrac...
متن کاملA Comparative Study of Multi-Attribute Continuous Double Auction Mechanisms
Auctions have been as a competitive method of buying and selling valuable or rare items for a long time. Single-sided auctions in which participants negotiate on a single attribute (e.g. price) are very popular. Double auctions and negotiation on multiple attributes create more advantages compared to single-sided and single-attribute auctions. Nonetheless, this adds the complexity of the auctio...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1210.5984 شماره
صفحات -
تاریخ انتشار 2012